Deviation coding of factor variables

last modified

2024–11–27

Deviation coding of factors (also known as “sum to zero contrasts”) as implemented in R’s contr-sum produces coefficients which represent the deviation under the given level from the mean of levels.

This interpretation is at first hard to reconcile with the contrasts contr.sum returns:

contr.sum(3)
  [,1] [,2]
1    1    0
2    0    1
3   -1   -1

They seem to code for the difference between the first \(l - 1\) levels against the last level, which seems to act as a reference level.


Let’s derive regressors whose coefficients represent the deviations from scratch. We start with a design matrix using dummy variables (or “treatment coding”) without a constant regressor.

(X <- rbind(diag(3), diag(3)))
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1
[4,]    1    0    0
[5,]    0    1    0
[6,]    0    0    1

Estimating coefficients for these regressors amounts to calculating the mean over all values belonging to the level.

(y <- cbind(c(1, 6, 5, 3, 2, 11)))
     [,1]
[1,]    1
[2,]    6
[3,]    5
[4,]    3
[5,]    2
[6,]   11
pinv(X)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]  0.5  0.0  0.0  0.5  0.0  0.0
[2,]  0.0  0.5  0.0  0.0  0.5  0.0
[3,]  0.0  0.0  0.5  0.0  0.0  0.5
(b <- pinv(X) %*% y)
     [,1]
[1,]    2
[2,]    4
[3,]    8

Once we have these coefficients, calculating deviations is simple.

(C <- eye(3) - ones(3) / 3)
           [,1]       [,2]       [,3]
[1,]  0.6666667 -0.3333333 -0.3333333
[2,] -0.3333333  0.6666667 -0.3333333
[3,] -0.3333333 -0.3333333  0.6666667
(d <- C %*% b)
           [,1]
[1,] -2.6666667
[2,] -0.6666667
[3,]  3.3333333

So how would we have to set up regressors to get deviations directly as coefficients? We have
\[ d = C ~ b \]
and
\[ Y = X ~ b + \epsilon, \]
and we want
\[ Y = X_d ~ d + \epsilon = X_d ~ C ~ b + \epsilon, \]
and therefore for the new design matrix it needs to hold
\[ X = X_d ~ C. \]
\(C\) is a square matrix, but it is rank deficient because rows and columns add up to zero, and it is therefore non-invertible. We can use the pseudoinverse,
\[ X C^- \]
but the result is rank deficient, too. What \(X C^-\) is missing in comparison to \(X\) is the ability to model the mean. That can be remedied by adding a constant regressor, but the resulting matrix is still rank deficient.

Instead, we take the pseudoinverse of C with the last row removed

pinv(C[-3, ])
              [,1]          [,2]
[1,]  1.000000e+00  1.110223e-16
[2,]  1.665335e-16  1.000000e+00
[3,] -1.000000e+00 -1.000000e+00

which is identical to the result of contr.sum(3) apart from rounding errors.


After adding the constant regressor, the resulting design matrix is

(Xcs <- cbind(ones(6, 1), X %*% pinv(C[-3, ])))
     [,1]          [,2]          [,3]
[1,]    1  1.000000e+00  1.110223e-16
[2,]    1  1.665335e-16  1.000000e+00
[3,]    1 -1.000000e+00 -1.000000e+00
[4,]    1  1.000000e+00  1.110223e-16
[5,]    1  1.665335e-16  1.000000e+00
[6,]    1 -1.000000e+00 -1.000000e+00

and it estimates the mean as well as the deviations from the mean for the first two levels:

pinv(Xcs) %*% y
           [,1]
[1,]  4.6666667
[2,] -2.6666667
[3,] -0.6666667

Therefore contr.sum indeed configures a factor for estimation of deviations from the mean, albeit for one less than the number of levels. The deviation for the last level can be obtained as the negative sum of the other deviations.
What appeared as a “reference level” is just the left-out level.

contr.sum does not provide the ability to specify the left-out level. memisc:contr.sum has that ability, and also creates better contrast names.

data <- data.frame(
  y = c(1, 6, 5, 3, 2, 11),
  f = factor(c("A", "B", "C", "A", "B", "C"))
)
contrasts(data$f) <- memisc::contr.sum(levels(data$f), base ="A")
contrasts(data$f)
   B  C
A -1 -1
B  1  0
C  0  1
lm(
  y ~ f,
  data = data
)

Call:
lm(formula = y ~ f, data = data)

Coefficients:
(Intercept)           fB           fC  
     4.6667      -0.6667       3.3333  

There is actually no need to leave out a level. If coefficients are estimated using the pseudoinverse, the mean and the deviations can be obtained together.

(Xd <- cbind(ones(6, 1), X %*% pinv(C)))
     [,1]       [,2]       [,3]       [,4]
[1,]    1  0.6666667 -0.3333333 -0.3333333
[2,]    1 -0.3333333  0.6666667 -0.3333333
[3,]    1 -0.3333333 -0.3333333  0.6666667
[4,]    1  0.6666667 -0.3333333 -0.3333333
[5,]    1 -0.3333333  0.6666667 -0.3333333
[6,]    1 -0.3333333 -0.3333333  0.6666667
pinv(Xd) %*% y
           [,1]
[1,]  4.6666667
[2,] -2.6666667
[3,] -0.6666667
[4,]  3.3333333

Unfortunately, contrasts(data$f) <- pinv(C) doesn’t work.